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ABSTRACT 


In this thesis, depth information extraction via ordinal measures for logical pattern 
correlation is investigated, along with matching techniques that translate well into 
hardware implementations. First, the ordinal correspondence metric is evaluated for 
complexity and performance using traditional correspondence matching techniques. The 
results show that the ordinal measures are very robust in stereo correspondence when 
paired with modest error handling techniques. Second, the ordinal measures are applied 
to an efficient dynamic programming matching algorithm to produce high-quality 
disparity maps. It is shown that ordinal measures are demonstrably fast in software and 
that they can be adapted to an extremely fast hardware implementation to produce high 


quality disparity maps for very high resolution images in real time. 
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EXECUTIVE SUMMARY 


Solving the stereo correspondence problem to produce disparity maps offers 
robots an effective means to segment and understand the real world. Disparity maps are a 
three-dimensional (3D) reconstruction of the scene in front of a stereo pair of cameras. 
Significant research has gone to make solving for the disparity map faster and more 
accurate so that they can be implemented in real time for robotics and other commercial 
applications. A robot capable of determining objects based on their spatial information in 
a disparity map can potentially interact with them in much the same way humans see and 


interact with objects. 


The objective of this research was to investigate a matching metric and computer 
vision algorithm, each with acceptable accuracy and potential for efficient hardware 
implementation. The areas of interest are: 

1. The correlation coefficient «'s accuracy in determining matches, 


ee The implemented metric hardware utilization and latency. 


Also, the correspondence matching algorithms were looked at for: 


1. Computational complexity, 
2: Accuracy in determining depth from a stereo pair of images, 
a: Memory requirements. 


For this research, the ordinal correlation coefficient « was evaluated by the 
metric objectives, and a traditional pattern matching algorithm and a dynamic 
programming pattern matching algorithm were evaluated under the correspondence 


matching algorithm objectives. 


The stereo correspondence problem is solved by matching points between a stereo 
pair of images. For every point in the reference image of the stereo pair, a matching point 
is found in the target image of the stereo pair. This match is described by a translation 
vector from the reference pixel in the reference image to the target pixel in the target 


image. All the disparity vectors consolidate into a dense map called a disparity map. The 


Xlil 


disparity map is represented such that the intensity at each point is proportional to the 


magnitude of the disparity vector (proportional to depth) at each point in the dense map. 


The ordinal itself was investigated for its candidacy toward hardware 
implementation. The proposed architecture can be fully synthesized in a field 


programmable gate array (FPGA) and computes the first correlation coefficient « in 


5+] log, n | clock cycles fully pipelined. While an actual implementation was not 


possible for this research due to time constraints, an implementation of only the ordinal 
as a macro function could be sufficient to achieve real-time dense disparity map 
computation. The ordinal satisfactorily met the objectives of performance and for 


candidacy toward hardware implementation. 


The ordinal coefficient « was first evaluated for performance with a traditional 
back-matching strategy. The back-matching strategy is a “greedy” method where the best 
match within a set of possible candidate matches is chosen by maximizing the correlation 
coefficient «. However, the traditional back-matching strategy alone introduces too much 
error from the matching process as a result of distortions inherent to stereo vision (such as 
occlusions, specularity, depth boundaries or projective distortion). To mitigate this, the 
method was improved by ad-hoc error correction. Two such error correction techniques 
were evaluated: “masked averaged hole fill,” resembling a zero-order hold error correction, 
and “masked linear hole fill,” resembling a first order linear interpolation error correction. 


The results were of significantly higher quality after error correction. 


The ordinal coefficient « was next evaluated using dynamic programming, 
producing raw results of much higher quality and accuracy compared to the traditional 
strategy unaided by ad-hoc error correction. The dynamic programming algorithm 
investigated offered greater latitude in computational complexity and memory usage 
optimizations over the traditional technique. The memory requirements were proportional 
to the computational complexity such that an efficient and fast implementation of the 
dynamic programming method could be achieved in an FPGA. The dynamic 
programming method met the objectives of computational complexity, memory 


requirements and accuracy. 
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I. INTRODUCTION 


A. BACKGROUND 


As the field of robotics advances, powerful sensor fusion and processing is 
needed to recognize and interact with objects and the world autonomously in real time 
[1]. There are many avenues for object recognition in robotics such as radar, sonar, optics 
and manipulation [2], [3], [4]. In the natural world, vision provides the majority of spatial 
awareness and object recognition since vision and video are signals carrying vast 
quantities of information about the world. Visual information can be passively obtained 
from cameras, offering improved security, safety and power advantages for robots. Thus, 
image and video processing has been an area of interest for advanced robotic intelligence 


navigation. 


Video processing can be computationally intensive, requiring powerful and costly 
visual systems to process large data streams. Efficient implementation of video 
processing algorithms is necessary to handle the data meaningfully. Furthermore, 
accurate object segmentation often requires sophisticated algorithms that can be difficult 
to implement in real time [5]. An efficient implementation of computer vision algorithms 
with the intent for use in robotics and machine intelligence was investigated in this 


research. 
1. Computational Complexity 


Computer algorithms are analyzed using a special notation that permits a 


generalized comparison independent of exact implementation details. The notation used 


in this thesis for algorithm analysis is the big theta notation O( g) where g is a function 


proportional to the average execution time of an algorithm but not approaching 
asymptotic execution time. The big theta notation accounts for an algorithm’s 
computational complexity (for loops and some lower bound constant multipliers) but 


ignores implementation specific details such as the time needed to execute a particular 


procedure such as adding, multiplying or moving data. For example, an algorithm has 
two for loops iterating through a data set that has n input elements. One of the for loops 


also requires a lower bound constant multiplier c to execute. The big theta notation for 


this algorithm’s complexity would be @(cn’) indicating that the algorithm would 


require at least approximately cn’ time to execute. Also, the big theta notation allows for 


implicit comparative execution time such that O(m+n) = @(max {m,n}). In this case, 


the execution time (computational complexity) of the algorithm is at least proportional to 


the greatest of all the terms in g. The notation can also be used to indicate complexity in 


memory requirements or complexity in hardware implementation [6]. 

Hardware implementations can be tailored to be fully sequential or fully parallel. 
Thus, the computational complexity O( g) in time comes at a trade-off with the 
hardware complexity O( ci ) in instance replication. Given an algorithm with the potential 
for full parallel implementation that takes ©(n) time to execute in software, a hardware 
implementation could instantiate a single instance of the procedure such that it has @(1) 


instantiated (spatial) complexity but requires O(n) time to execute the algorithm. The 
hardware could also have n instantiations of the algorithm procedure such that it has 
O(n) complexity in space but executes that algorithm in @(1) time. Thus, a machine 
could have a theoretical lower bound instantiated complexity @(1) and a theoretical 


upper bound in instantiated complexity of an algorithms worst case execution time 


O(n). This is analogous to transforming an algorithm’s execution in time to an 


equivalent execution in parallel space. 
2. Hardware Implementation 


Complex procedures can be implemented in hardware as macro functions that can 
outperform similar software implementations. For example, a floating point 
multiplication can be accomplished with fixed point multipliers and a software driver, but 


at reduced performance and power efficiency with respect to a floating point macro in 
2 


hardware. For this reason, processors commonly have hardware specific for floating point 
operations. In many cases, functions and procedures have opportunities for efficient 
hardware implementation that can speed up the execution of algorithms dependent on 


such functions. 


In computer vision, algorithms commonly iterate through all the pixels of an input 
image. This gives computer vision algorithms the characteristic burden of dimensionality 


with at least the growth function ©(MN ) where M and WN represent the number of 


rows and columns of the image, respectively [7], [8]. Such a growth function can get out 
of hand quickly, especially if the core procedure is not trivial. For example, if the 
algorithm were to perform only a single floating point calculation, a slow floating point 
implementation that executes in ¢ time grows to a total execution time of at least MNt. If 
t is large and the image has many pixels, a real time execution may not be possible with 
the slow procedure. In the aforementioned growth function, a procedure is executed for 
every row and for every column in each row (nested loops). If the procedure of the inner 
loop (the loop executed the number of times described by the growth function) is not 
trivial as commonly seen in computer vision algorithms, a hardware implementation may 


be necessary for feasibility. 


Hardware implementation can be achieved through a wide range of products as all 
integrated circuits are essentially a hardware implementation of some function or 
procedure. A widely available device for rapid prototyping and hardware implementation 
is the field programmable gate array (FPGA). FPGAs are a “blank slate” integrated 
circuit that can be configured with a hardware function after the device has been 
fabricated. This has made FPGAs useful for assisting algorithms with fast 
implementations of complex procedures that are too slow for software without the costly 


design and fabrication of application specific integrated circuits (ASICs). 


For this research, hardware implementation is assumed to be implemented in a 
typical FPGA, such as Altera, which has unique logic synthesis characteristics which may 
be dramatically different from another FPGA technology. The implementations are 


intended to be synthesized and run at a clock speed of the order of 100 MHz. Thus, one 


S) 


clock cycle is defined as 10 nanoseconds, and a circuit that requires one clock cycle in an 
Altera FPGA (i.e., a Cyclone or Stratix device) must not have any transient activity 
(clock slack) exceeding 10 nanoseconds. When any architecture is proposed, one must 
carefully consider the gate-to-gate propagation delay or critical path, which can be highly 
dependent on a device’s architecture and the combinatorial complexity of the circuit to be 
synthesized. The clock cycle numbers proposed in this research were verified using the 
Altera Quartus II 8.1 software and the Timequest Timing Analyzer slow 1100mV and 85 
degrees Celsius model on the Cyclone family of FPGAs. 


3. Software Simulation 


Hardware and software implementations are far from exclusive of one another. 
Algorithms with potential for hardware implementation often still require management of 
memory and states; also, software implementations of algorithms must always operate 
above at least some hardware. However, some software routines such as dynamic 
memory allocation can be very difficult to implement in hardware. Thus, some software 
routines are not easily synthesizable into a hardware equivalent, but any hardware routine 
can be simulated with software and basic data management capabilities. An illustration of 


synthesizable hardware routines and software routines is shown in Figure 1. 


| ites 
Hardware 
Figure 1. Software and Hardware Venn Diagram. 


On a personal computer, software has access to the fundamental hardware 


building blocks, such as adding and multiplying, needed to construct complex procedures 
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that may or may not be possible in hardware. For the purpose of this research, software 
procedures that have potential for efficient hardware implementation are considered but 
are not actually synthesized. Software simulation of such procedures allows for accurate 


estimation of hardware performance and complexity. 
B. THE CORRESPONDENCE PROBLEM 


In computer vision, the depth of a scene can be calculated by solving the 
correspondence problem between two images. A scene’s depth is a three-dimensional 
process that can be calculated from the distortion between two or more images. A 
computer must first identify points in the different views that correspond in space, a non- 
trivial problem that requires complicated pattern matching metrics and schemes to 
accurately correlate points. Recent research in the correspondence problem has been in 
algorithms that can accurately identify matching points in real time [5]. Real time is the 
equivalent of human perception of time or approximately 30 frames per second. The 
commercial application of the correspondence problem is the generation of disparity 
maps for object recognition and segmentation. The magnitude of the disparity map is 
inversely proportional to the depth of the scene represented by the different viewing 


angles. 


Disparity maps are typically a dense matrix of vectors wherein the matrix 
dimensions are approximately equal to the dimensions of the original images. In a stereo 
pair of images, one image is taken as the reference image, and the other image is the 
target image. A pixel from the reference image represents a point in space and is 
correlated to a pixel in the target image. The horizontal and vertical translation of the 
correlated point from its reference pixel in the reference image to the target pixel in the 
target image is described by a vector. This vector is illustrated in Figure 2. The 
magnitude of each vector in the disparity map is the inverse of the depth at that point in 


the reference image. 


























dx 
ti > 
dy 
3 
Target Pixel 
Reference Pixel 
Figure 2. Disparity Map Vector. 


A disparity map typically is shown as an intensity image where the intensity at each 
pixel is the magnitude of the disparity vector. Since disparity is inversely proportional to 
depth, greater intensity means the correlated point in space is closer to the viewer, and 
lesser intensity is farther from the viewer. The disparity map can be an accurate 


representation of objects in space and can be used effectively for object segmentation [9]. 


The disparity map measures the depth process, and thus, the process of evaluating 
disparity has an associated accuracy. Accuracy in a disparity map is the disparity map’s 
fidelity to the depth process and not the appearance of the disparity map itself. Thus, an 
accurate disparity map will closely sample the depth process. For this research, none of 
the experiments have a ground-truth depth for objective accuracy evaluation. Therefore, 
the accuracy of the disparity map is subjectively determined by comparing object edges in 
the disparity map with the corresponding object edges in the reference image manually. If 
the edges agree, the disparity map is accurate, and if they do not agree, the disparity map 


is not accurate. 
C. OBJECTIVE 


The objective of this research was to investigate a matching metric and computer 
vision algorithm each with acceptable accuracy and potential for efficient hardware 
implementation. The objectives of interest are: 

1. The correlation coefficient «'s accuracy in determining matches, 

2. The implemented metric hardware utilization and latency. 
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Also, the correspondence matching algorithms were looked at for: 


i Computational complexity, 
Ze Accuracy in determining depth from a stereo pair of images, 
3: Memory requirements. 


For this research, the ordinal correlation coefficient « was evaluated by the 
metric objectives (given a particular matching algorithm), and a traditional pattern 
matching algorithm and a dynamic programming pattern matching algorithm were 


evaluated under the correspondence matching algorithm objectives. 
D. C/C++ IMPLEMENTATION 


Since the correspondence problem solution can be very complex, we opted to use 
C and C++ due to our familiarity with said programming languages. Mathworks 
MATLAB was used to evaluate some parts of algorithms discussed in one dimension for 
illustrative purposes. We chose to run all stereo image experiments using C and C++ in 
the Microsoft Visual Studio design environment. C and C++ can implement the complex 
procedures of the correspondence problem and offer a reasonable degree of memory 


control. Many other implementations are feasible but not investigated in this research. 
E. RELATED WORK 


The correspondence problem has been researched heavily in the past using many 
different techniques. An overview of the various methods that have been investigated can 


be found in [10]. 


Dinkar N. Bhat and Shree K. Nayar [11] developed a distance metric « between 
rank permutations that demonstrated robustness in stereo correspondence matching. Their 
research also investigated the performance of « compared to other similar metrics using 


standard traditional pattern matching techniques. 


Changming Sun [12] developed an effective two stage dynamic program for 
window based pattern matching that finds the optimal solution to the correspondence 


problem with minimal computational complexity. 


In other efficient implementation research, traditional matching approaches in [5] 
and dynamic programming approaches in [13] have been investigated. Traditional and 
dynamic techniques such as these have also been looked at for FPGA implementation but 
using different pattern metrics including Hamming distance, sum of squared distances 


(SSD) and sum of absolute differences (SAD). 
F. THESIS ORGANIZATION 


This thesis is organized as follows: computer vision concepts are introduced in 
Chapter II while ordinal measures are defined in Chapter III along with the operations 
necessary to compute the correlation coefficient «. An application of the ordinal to the 
traditional greedy method for solving the correspondence problem is illustrated in 
Chapter IV. Finally, an improvement to ordinal matching using dynamic programming is 
introduced in Chapter V. The conclusions and recommendations for further research are 


presented in Chapter VI. 


Hl. COMPUTER VISION AND STEREO MATCHING 


A. PRINCIPLES OF COMPUTER VISION 


Computer vision is the discipline, similar to image processing and machine 
intelligence, which interprets data from sensors such as radar, sonar and cameras so that 
decisions can be made about the real world [14]. Some sources of computer vision are 
shown in Figure 3. There are many applicable control and guidance problems that could 
benefit from a computer capable of identifying or tracking objects within space. For 
example, an autopilot for a car could be an application of computer vision such that the 
guidance system would track the lines designating lanes in the road and steer the car to 
keep it driving safely within a lane. If the system used a video camera as the primary 
sensory input, this particular problem would require the computer to identify what parts 
of the video sequence were the lines on the road versus the background and other cars on 
the road. Typically, the computer would preprocess the video to accentuate a certain 
feature unique to the object(s) of interest, such as the edges of the lines. The computer 
could then perform thresholding of the pre-processed data to make a ‘yes’ or ‘no’ 
decision as to whether the feature belonged to a line on the road. Often, however, this 
form of segmentation is highly subject to statistical variation (lighting and camera 
parameters, such as gamma) that can disturb the decision-making ability of the computer. 
This is because most segmentation algorithms are forced to make assumptions about the 
video sequence being statistically stationary, which may not always be true. Nevertheless, 
computer vision algorithms endeavor to distill information from raw sensor data to a 


level that a computer can understand and base decisions on. 


Objects 
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Phased Array Radar Sonar Cameras and Video 
Figure 3. Computer Vision Examples—Radar, Sonar and Cameras. 


Decision making for segmentation in computer vision can be taken to a slightly 
higher level using labels to identify objects of interest in space. In the aforementioned 
line and autopilot example, a computer could identify a line against non-line components 
of an image by assigning a ‘1’ to a pixel representing a line and a ‘0’ to a pixel 
representing anything else. This is called a binary image, since all the pixels of the image 


are of the binary set {0,1}. A computer can assign any arbitrary value (not restricted to 


‘1’ or ‘0’) to an object, such that unique objects are given a unique “label” value. 
Fundamentally, the computer will still be restricted to logical ‘yes’ or ‘no’ decisions 
regarding the sovereignty of objects in space, but labeling does give the computer the 
capacity to distinguish multiple objects from one another. This leads to a broader 
application of the computer’s decision-making capabilities. In the autopilot example, a 
line could be distinguished from other objects by means of a threshold, which forces all 
data above the threshold into one category and all data below the threshold into the other 
category. Perhaps, by means of another algorithm, image information could be distilled 
into a variety of categories, with each category having its own label. For example, objects 
can be labeled into categories based on their spatial ordering, increasing order from the 
top left of the image or video sequence to the bottom right. The computer can make 


smarter decisions by testing aspects of each category, or type, of element within the 
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distilled information. Categorical labeling provides a computer more latitude in object 


segmentation while still only using binary decision making. 


As an alternative to the basic edge detection and threshold pre-processing 
algorithm, algorithms that solve the correspondence problem can translate pixel intensity 
values of an image into categorical information unique to objects based on their location 
in space. Traditionally, a computer can segment an object from a video sequence or scene 
as pixels on or around said object change intensity. For example, motion can be detected 
as changing intensities of pixels around objects. Only one video sequence is necessary to 
detect motion of an object. However, given a special video sequence of two separate 
views of the same scene—called a “stereo image sequence,” a computer can potentially 
compute spatial attributes of objects in the scene and distinguish object uniqueness. This 
allows the computer to extract spatial properties of objects that can be categorized and 


labeled. 


Stereo images depict a scene or object in focus from two different angles. The 
focal points of a pair of cameras (such that they form a stereo pair) would ideally 
converge on the object or area in the space of interest. Since cameras sample the three- 
dimensional world and translate it into a two-dimensional representation, each camera in 
a stereo pair distorts the scene in a subtly different way. The distortion between the two 
cameras’ perspectives is inversely proportional to the depth, representing the third axis 


bisecting the angle formed by the focal lengths of each camera (shown in Figure 4). 


Computing 


System 
Field of View 





Figure 4. Stereo Pair of Cameras Observing an Object. 
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Stereo matching algorithms attempt to solve the correspondence problem. The 
correspondence problem is a challenging computer science problem wherein a set of 
points from one image are identified in the other image of a stereo pair. The human visual 
system solves the correspondence problem in real time by matching patterns and edges 
seen from one eye to patterns and edges seen from the other eye [14]. There are many 
different algorithms, but some robust algorithms rely on matching patterns within 
windows around each point of interest. Matches are determined using metrics that 
compute the similarity of a window from the left image to a window from the right 
image. The pair of windows with the greatest similarity is deemed a match and the 
disparity of the points can be computed. Disparity is defined as the translation of a point 
from its location in one image to its location in the other image and is inversely 
proportional to depth. Thus, a point that moves considerably between the two images is 
closer to the stereo pair of cameras than a point that moves only slightly or not at all. If a 
pixel represented an object located precisely at the intersection of the focal lengths of two 
cameras, the point’s translation would be equal to zero and thus represent both zero 
disparity and infinity depth. This effect is reduced as the two cameras’ focal lengths 
become parallel; but, as the focal lengths approach parallel, the sensitivity of the system 
to depth variation is reduced until nearly completely washed out by a very large constant 


disparity and zero field of depth. 


The stereo correspondence problem can be considered in either a general case or a 
specific case. The general case of the correspondence problem makes no assumption 
about the scene other than that there are physically points in the scene that correspond. 
The general case often requires more processing because the search for matches must 
occur in both dimensions of each image to be successful. The specific case makes use of 
known parameters about the cameras and the scene they observe such as the distance 
between the cameras, the focal lengths or the angles between the camera focal lengths 
[15]. This allows for more precise algorithms that can take advantage of triangulation to 
accurately compute the true depth of the scene. The specific case of stereo 
correspondence often employs algorithms that correct the scan lines (the true horizontal 


lines of the images with respect to their rows that make up the pixels in the image) of one 
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image with respect to the scan lines of the other [15], [16]. This allows the computer to 
consider the correspondence problem in only one dimension. For the purposes of this 
thesis, it is assumed that parametric data about the stereo pair is unknown and thus only 


the general case of the correspondence problem will be considered. 


Stereo matching algorithms solve the correspondence problem to create a 
disparity map representing all the depth information between the two cameras [10]. The 
disparity map is essentially distilled information about all the objects in view of the two 
cameras. A computer can process the disparity map by placing the measured disparity 
values into categorical bins representing sovereign objects and their locations. This 
segmentation of objects is thus based on spatial properties rather than intensity based pre- 
processing typical edge detection and threshold illustrated in the car auto-pilot example. 
A disparity map of a real scene will predominantly consist of surfaces, seen as regions of 
constant or constantly changing disparity values. Distortions of disparity information 
within a region of constant disparity often represent an object in front of or resting on a 
surface. Ultimately, the disparity map typically offers a computer more information 
useful for object segmentation than any threshold algorithm, allowing the computer to 
discriminate objects using a labeling technique intrinsic to the level-like nature of 
disparity maps. An example is shown in Figure 5 where the disparity map is on the right, 
and its associated reference image from which it was derived is on the left. The higher 
intensities in the disparity map equate to larger disparity vectors and, in general, the 
closer the objects represented by those vectors. The original image on the left hand side is 


just one of a pair of stereo images (not shown for convenience). 
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Figure 5. Example Disparity Map. 


B. STEREO MATCHING CHALLENGES 


Generally speaking, computers do not respond well to solving problems involving 
far from ideal conditions, a limitation particularly plaguing to the correspondence 
problem. In the best of circumstances, the correspondence problem requires the computer 
to measure a distortion, which, in other words, is the computer attempting to match two 
similar but often non-identical features. This requires the computer’s measuring 
mechanism to have some looseness in determining if features or patterns actually match, 
avoiding false negative matches, while sustaining some degree of discrimination so as to 


minimize false positive matching. 


While the problem is inherently problematic for a computer, computers and 
algorithms must contend with the fact that even miniscule variations between the two 
cameras can severely stunt an algorithm’s matching abilities. For example, if one camera 
has a slightly different focus from the other, it is possible that while the cameras are 
observing the same scene, there will be no matching patterns or features between the 
stereo pair of images. This effect will be observed later and is difficult to mitigate in the 
matching process. Another problem for algorithms solving the correspondence problem 
using statistical metrics for match determination is that matching could become 
problematic if images have varying statistical characteristics as shown in Bhat and Nayar 


[17]. 
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Stereo images themselves introduce distortions as a result of sampling a three- 


dimensional object or scene as a two-dimensional image. For this thesis, this type of 


distortion or error will be referred to as occurring outside the system and cannot be 


mitigated but can potentially be interpolated as a result of spatial correlation. There are 


four types of distortion: depth discontinuity, occlusion, specular reflection and projective 


distortion: 
1. 


Depth discontinuity commonly occurs when the normal of a surface is 
very close to perpendicular to one camera’s focal length but still visible 
and completely unobserved by the other camera. Surfaces that create depth 
discontinuities contribute to the non-linear nature of the disparity map. 


Occlusion occurs when an object or surface in the foreground covers 
objects or surfaces in the background so that the background behind the 
object appears differently to either camera. Thus, no matching points exist 
in the region of occlusion when an occlusion occurs. 


Specular reflection causes a pattern to occur in differing locations in each 
image while not being consistent with the true depth profile of the scene 
such as a mirror. 


Projective distortion, the most common form of distortion in stereo 
matching and partially touched on before, occurs as a result of sampling a 
three dimensional object or scene as a two-dimensional image. 


These distortions are illustrated in Figures 6 through 9. 





Figure 6. Example of Depth Discontinuity. 





Figure 7. Example of Occlusion. 
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Figure 8. Example of Specular Reflection. 





Figure 9. Example of Projective Distortion. 


Distortions frequently introduce matching error that will need to be compensated 
for. Earlier implementations of stereo matching algorithms handle distortion in hindsight 
as though the information did not exist. More advanced algorithms overcome error by 
finding the optimal solution (match) even when a good match does not exist [18]. Both 


implementations will be investigated in detail in later chapters. 
C. DISPARITY MAPS AND OBJECT SEGMENTATION 


The disparity map is inversely proportional to the depth observed by the stereo 
pair of cameras. A disparity map will be referred to in this thesis as a solution and is 
commonly very dense, having a solution for the translation of every point in an image to 
every corresponding point in the other stereo image. Disparity maps are made up of 
vectors in Cartesian coordinates but are rendered as an intensity image such that the 
magnitude of the vectors at each point is proportional to the intensity. For example, if an 
arbitrary point P in Image J were to move three pixels right and one pixel down to P’ in 
Image 2, the intensity of the disparity map at that point would be proportional to the 


magnitude 3.16 as demonstrated in Figure 10. 


16 





























Image 1 Image 2 


Figure 10. Example of a Disparity Vector. 


Because natural images generally have high spatial correlation in two dimensions, 
this generalization applies to the third dimension measured by stereo correspondence 
algorithms. As a result, objects commonly have similar or consistent disparity vectors. 
Computers can take advantage of this spatial correlation to segment objects based on the 


magnitude of the disparity vectors as depicted in Figure 11. 
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Figure 11. Example Stereo Pair and Disparity Map with 
Distinctive Object. After [19], [20] 


Notice that the object in the foreground can be distinguished from the 
background because it is represented by a region of consistent disparity vectors. Unlike 


edge detection based methods that use intensity values of pixels of an image, the object 
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is spatially segmented from its surrounding using the disparity map because the 


information in the disparity map is directly related to depth. 


Another feature of disparity maps is that the precision of a disparity map increases 
as the resolution of the reference images (the stereo pair) increases. Higher resolution 
images can have objects consisting of more pixels and patterns. Thus, more pixels means 
that while the location of an arbitrary point is constant in space, the number of pixels 
separating points P and P’ representing the point increases with the resolution of the 
stereo pair as shown in Figure 12. Subsequently, the disparity vectors become 


proportionally more precise. 





Figure 12. Comparison of Low and High Resolution Stereo Images. 


In the next chapter, the ordinal coefficient « is introduced. The ordinal is a 
logical metric that has advantages for hardware implementation. A proposed architecture, 
also introduced in the next chapter, computes the correlation coefficient « in parallel and 


fully pipelined on an FPGA. 
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HI. ORDINAL MEASURES AND THE CORRELATION 
COEFFICIENT K 


A. WINDOW COMPARISON METHODS 


Area based matching algorithms use windows to match corresponding points 
between two stereo images. Ideally, in a stereo pair with infinite spatial resolution, a pixel 
alone would be uniquely identifiable and, thus, can be matched from one image to 
another. However, since real pixels have finite spatial resolution, they do not carry 
enough information to be matched without additional pattern-based information of 
the neighborhood surrounding a pixel. Thus, point-by-point matching for stereo 
correspondence in the time domain becomes pattern matching with window based 


methods, matching patterns around points shown in Figure 13. 





Figure 13. Defining a Pattern using a Window around a Point in the 
Reference Image. After [19] 


Window based methods generally compute a coefficient proportional to the similarity 
of patterns between two windows [10]. This coefficient is called a correlation coefficient. 
Thus, finding a “best match” is an optimization problem where a match is determined by 


maximizing (or minimizing) a matching method’s correlation coefficient. Often in images 
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with very high signal-to-noise ratios (SNR), simple optimization algorithms are enough to 
determine matches. Also, window based methods generally rely on a standard pattern 
matching technique. First, a reference point is chosen sequentially from the reference image 
(perhaps the left image for demonstrative purposes). A window is then built around this point 
so that a reference pattern is used to represent that point. Next, the target point(s) are chosen 
within a search range of the target image (the right image in this example). Windows are also 
constructed around these target points (overlapping allowed) to form representative patterns 
around each target point. Finally, the window based correlation computation is applied to all 
target patterns (windows), comparing each target pattern with the reference pattern. The 
process is illustrated in Figure 14. A match is determined by maximizing the correlation 
coefficient among all possible matches. The pattern based window matching is analogous to 
matching the individual points or pixels of the two images provided that the windowing 


scheme is consistent. 
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Figure 14. Pattern Matching Technique. 


Window matching methods use search ranges to identify the best match among a 
candidate region of possible matches. Often, search ranges account for the majority of 
computational complexity in any pattern matching technique. Search ranges manifest as 
deeply nested loops or loops within loops. For example, for every reference point in one 


image, the algorithm will need to find its best match within a search range of five rows 
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and fifteen columns. This equates to 75 simple match calculations per pixel in the 
reference image, which can be a staggering increase in computational time even for small 
images. Often, simple matching calculations are not enough and more complex matching 
techniques are needed and will be explored in depth later. Search ranges also intensify the 


computational issues of the chosen correlation coefficient. 


The correlation coefficient computed for a window appraises the similarity of the 
patterns within the pair of windows. Many of these measures use statistical means to 
compute the correlation coefficient such as the sum of absolute differences (SSD) given 


by 


ssp = (1-15) (1) 


i=l 


or the normalized correlation coefficient (NCC) given by 


n 


D(A -F)(2-4) 


NCC =——® (2) 


(E(1-1) [Ete-2) 


i=l i=l 








Notice that the computations for Equation (2) are not trivial. Other metrics use filtering 
techniques to compute the similarity to a reference window at each point in the target 
image. These methods are subject to outliers in pixel intensities such as an image pair 
corrupted with salt and pepper noise. Other measures such as the ordinal are dependent 
on the ranking of pixels and their relative ranking to each other. Ordinal metrics are very 
robust to statistical variations between the two images such as gamma and contrast issues 
but can be frustrated by additive white Gaussian noise (AWGN) as shown in Bhat and 
Nayar [11]. Statistical measures can be very computationally expensive, requiring many 
multiply and add operations per window comparison. The measure of interest for the 
research conducted is the ordinal metric that uses the rank permutations of pixels in 


windows to compute the correlation coefficient x. 
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B. ORDINAL MEASURES WITH K 


Ordinal measures are used to compute and compare the rank permutations of 
candidate windows to compute the correlation coefficient«. The ordinal « has been 
shown to be very robust in comparison to statistical correlation metrics such as SSD and 
NCC [11]. The ordinal also has computational advantages for hardware implementation 


and robustness in choosing matches that are investigated in later sections. 


The goal is to define a measure of correlation between the two sequences which is 
least sensitive to random variations of individual values and, at the same time, easy to 


compute on sequential and parallel machines. 


This leads to an approach based on the ordering of the values rather than the 
values of the sequences themselves. For example, take three sequences of length three as 


x, =[2.5,—-3.2,1.7], x, =[1.9,-5.1,0.8], and x, =[0.1,2.5,0.9]. One can see that, by the 
ordering of the values, x, and x, are similar, while x, and x, are not since, in this case 
x, (2) < x, (3) < x, (1), x, (2) < x, (3) < x, (1), and x, (1) < x, (3) < x, (2). It can be seen that 
the sequences x, and x, have the same “ordering” of values while x, is different. 

In this section, the basic theoretical framework of definitions and metrics is 


introduced so that the similarities of sequence ordering can be assessed. In particular, the 


following is defined: 
Definition 1: For any integer n call S, the set of all permutations of the indices 
1,2,..1. For example, S, = {(1, 2,3), 1,3, 2), (2,1, 3), (2,3, 1), (3,1, 2), (3, 2,D}. Clearly, the 


“cardinality” of S, (i.e., the number of elements in the set) defined as |S 
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as: 


|S, =n! (3) 


Definition 2: Given a set of index permutations S,, call 7 =(z',z’,..,.2z")ES, 


an element of the set. For example, (again for n =3) choosing z = (3,1,2) corresponds to 
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nm =3,0 =17 =2. By the very nature of these definitions, each permutation 7 €S,, 





can be seen as a mapping between the indices (1,2,..,7) and the ordering is 


m:(1,2,...,.n) >(2',2’,...,2"). (4) 


Any sequence x of real numbers can be converted to its associated permutation 


m by 


a! =I (x(i)>x(/))+ D4 (s@=x(i)). (5) 


where J(B)=1 if the expression B is true, /(B)=0 if the expression B is false, and i 
is the specific sample in the vector x being evaluated. 
An example of transforming two data sets x, and x, into each respective rank 


permutation z, and 7, is shown in Figure 15. 
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Figure 15. Example Rank Permutation Conversion. 


This leads to the definition of inverse permutation as follows: 
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Definition 3: Given any permutation z €S,, the inverse permutation z'€S, is 
defined as the mapping 


3 (0', yur") > (12,0057) (6) 


where z' =k and (x7) =i for i,k =1....,n. 

For example, again in S,, the permutation z= (3,1,2) can be seen as the 
mapping z:(1,2,3) > (3,1,2). Its inverse is the mapping nm : (3,1,2) > C,2,3) which, 
after reordering, becomes z' : (1,2,3) > (2,3, 1). 

This leads to the definition of the positive and negative identity permutations 
I, =(1, Deas) and [_ =(n,n-1,...,2,1). The identity permutations have the property 
that the inverse of each identity permutation is itself such that ig =I, and T'=1.An 


example of the processing of the inverse permutation is shown in Figure 16. 
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Figure 16. Example Inverse Permutation z,’. 


Definition 4: Any two permutations can be combined into a single permutation s 
such that se, where s defines a permutation z, mapped to z,' and re-ordering the 


indices of s as 1,2,...,n. The formulation is given by 


s:((a') (ar) .-(a')) (#22) 7) 


; i 
where s' = 7} and k =(z;") ; 
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For example, given 0! = (2,31) and z, = (13,2), then s:(2,3,1) > (1,3, 2) and re- 
ordering the indices of s as 1,2,...,n, we get s= (2, 1,3). An example of computing the 


combined permutation s from z,' and z, is shown in Figure 17. 
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Figure 17. Example Combined Permutation s. 


Definition 5: A distance vector d,, is the relationship between s and the identity 
permutation 7,. When s=TJ,, the distance is zero such that d,, = (0, 0,...,0) , Indicating 


a ‘ : : : n 
perfect positive correlation, and when s=J/_, d, achieves a maximum distance of H : 


indicating perfect negative correlation. Finally, d' 


m 


di, =i- YI (si <i)=YI(si>i) (8) 


is defined by 


where J (B) =1 if the expression B is true and J (B) =O if the expression B is false. 


For example, given s = (2; 1,3) ,then d= (1,0,0). 


Definition 6: A normalized correlation coefficient « is the measure of correlation 
between z, and z, taken as the maximum distance in the d,, vector. If the input random 
sequences are perfectly correlated, « takes on the normalized value 1. If the input 
random sequences are perfectly un-correlated, K takes on the normalized value —1. The 
coefficient « is defined as 


2max’.,d, 
ea ae (9) 


ZS 


where z, and z, are permutations to be compared. Finally, the overall flow chart of data 
from sample data windows x, and x, to computing the correlation coefficient « is 


shown in Figure 18. 
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Figure 18. Ordinal Processing Flow Chart. After [11] 


C. HARDWARE IMPLEMENTATION ADVANTAGES 


Computation of the correlation coefficient « has significant advantages for 
hardware implementation. First, the coefficient « can be computed in a purely logical 
fashion requiring no summations or multiplication. Thus, in a hardware implementation, 
only basic logic slices of comparators and multiplexers are needed to realize the ordinal 


measure. For illustrative purposes, the hardware implementation will be explored. 
1. Rank Permutations 


The first component needed to compute the correlation coefficient « is to 
compute the rank permutations of each window. In software algorithms, rank 


permutations can be calculated quite quickly, especially when using the counting sort 


algorithm. Computation of rank permutation is inefficient with O(n’) complexity. With 
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hardware, the naive sort’s worst case scenario complexity is taken and spread out 
spatially to compute the rank permutation in full parallel to achieve computational 


complexity @(1) in time, or one clock cycle (given a typical sized input of n 8-bit pixels 
to be sorted), but O(n’) in space. For stability in the rank permutation, the hardware 


requires an additional O(n’) instantiated complexity but with no additional 


computational time. The architecture for computing the rank permutation of a two pixel 


window is shown in Figure 19. 
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Figure 19. Rank Permutation Architecture. 


The rank permutation circuit shown in Figure 19 needs eight comparators to 
compute the ranking of two pixels. The input to the circuit is each 8-bit pixel intensity in 
the window to be ranked. Each comparator outputs a logic ‘1’ if the inputs satisfy the 


Boolean expression in the comparator. The ranking of a pixel at z} is the summation of 


the number of pixels in the window less than the i” pixel plus the number of pixels equal 
to the 7” pixel. The summation is carried out by a logical “ones counter” circuit rather 
than actual adders. For very large bit vectors (over 128 bits), the ones counter circuit can 


Zz 


be realized as a sum-of-products or product-of-sums combinational circuit to minimize 
delay. Also, the equal comparators are masked to only be capable of summing up to 7. 
This ensures stability in the rank permutation by assigning rankings of equal intensity by 
the order of the pixels in the window. The output of the rank permutation is a vector of 


ranks encoded in binary coded decimal. 


Computation of correlation coefficient « requires two rank permutation 
components. The rank permutation hardware makes up the majority of the metric’s 
hardware implementation logic requirements. Because of the 2n’ instantiated complexity 
of the architecture, larger windows of nine by nine or 11 by 11 require significant logic to 
fully implement. With the proposed architecture, computation can typically be handled in 


one clock cycle for nine by nine or 11 by 11 sized windows of pixels. 
2. Inverse Permutation and the Combined Permutation s 


The inverse permutation is tricky to compute in hardware. The simplest 


implementation of the inverse permutation is to define two registers, one n sized register 
and one n’ sized register. For each rank, a set of n selectors compare the ranking at z/ 
with a set index 7. If the ranking is equal to the index, the rank is multiplexed to the new 
location by z; and all other rankings in the target portion of the register are set to zero. 
The first register holds the rank permutation, and the second register holds an expanded 


form of the inverse permutation. The expanded inverse permutation is to ensure that flip- 


flops in the register are never driven by more than one multiplexer. The expanded inverse 
permutation is condensed by logical or from n* to n. A proposed architecture for 


computing the inverse permutation for a window of two pixels is shown in Figure 20. 
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Figure 20. Inverse Permutation Architecture. 


Next, the combined permutation is easily calculated by multiplexing the non- 
inverted rank permutation with the inverted permutation as the control. The proposed 


combined permutation architecture is shown in Figure 21. Note that the inverse 
1 2 
permutation elements (2; ') and (x, ') can each only take on values [1. 2] unless a bit 


error has occurred. The inverse permutation is directly ported into n selectors that are n 


wide, passing the associated rank to the output with respect to the i” index of the inverse 


permutation. 
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Combined Permutation Architecture. 


Figure 21. 


3. Distance and Correlation Coefficient « 


The most complex part of calculating the correlation coefficient « with hardware 
is computing the distance vector. For this, a property of the distance calculation is 


exploited to parallelize the computation. The distance from the identity permutation is 


graphically illustrated in Figure 22. 
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Figure 22. Distance Scatter Plot. 
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The distance vector between the combined permutation s and the identity 
permutation is composed of the number of ranks bounded by the region defined by 


(x 20,x<i,y>i,y< n) or the region defined by ABCD in Figure 22 (ranks falling on 


the bottom boundary are not counted). Since the s permutation is compared to the 


1 


identity permutation /,, the inverse permutation s represents the ranks out of place 


with respect to 7, but not bounded by the region. Thus, the number of ranks in s that 
occupy this region is equal to the number of ranks out of place up to index i minus the 


number of s' ranks out of place up to index i. Thus, the distance vector is composed of 


two separate vectors which are defined as 


tise i (10) 


pos 


and 


cee Calera (11) 
, ‘ ksi ; 

D> Sie (12) 

Equations (10) and (11) are graphically illustrated in Figure 23 using two scatter 

plots of the s permutation and the s' permutation. The J vectors can be joined in 


parallel using Equation (12). The parallel merging of the J vectors is graphically 


illustrated in Figure 24. 


31 









































































































































A Identity line A Identity line 
7+ e = Tit e = 
) ) 
6-- ° 6-- : 
5-++ e 5+ e 
44- e 4+ e 
3+ © 3+ e 
Dials e 2-- e 
1-+ e 1+ e 
Lt a [ood hae re ok el 
123 4 5 6 7 123 4 5 6 7 
Jpos =[1100100] Jneg = [0010011] 
Figure 23. Computing the J Vectors. 
Jpos Jneg dm 

1 0 } mt 

1 0 |__| 2 

0 1 |} 1 

0 0 > 1 

1 0 Pm 2 

0 1 my 1 

0 1 > 0 

Figure 24. Calculating the Distance Vector in Parallel. 


In hardware, the summations of the two J vectors is trivial since any element in 


either vector is {0,1}. Thus, the summation is logically a ones counter, which can be 


implemented without the use of adders or a clock. A proposed hardware implementation 


is Shown in Figure 25. 
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Figure 25. Distance Hardware. 


Finally, « is computed by taking the maximum of the distance vector using a 


maximums array circuit that logically selects the greater d,, of a pair iteratively by stages 
until the maximum is found. The maximums circuit requires | log, n| stages to compute 
where each stage computes a partial maximum with each clock cycle. The « coefficient 
need not be normalized to fall between [—1,1] since this would require floating point 


computations. Instead, the raw maximum distance is used, and the optimization of « 
occurs when the maximum distance is minimized rather than maximizing the normalized 


coefficient. 


The proposed implementation suggests one clock cycle for both rank 
permutations, two clock cycles for the combined permutation, two clock cycles for the 


distance vector and | log, n| clock cycles to compute the maximum distance for a total 
latency of 5+ | log, n | at a clock frequency of 100 MHz (clock period of 10 


nanoseconds) in an Altera Cyclone family FPGA. The proposed architecture is also fully 
pipelined and can compute a result every clock cycle when the pipeline is full. The 


proposed architecture for a nine by nine ordinal (n = 81 and operating on 8-bit pixels) is 
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synthesized using the Altera Quartus II 8.1 software for several Altera devices and the 
logic utilization is tabulated in Table 1| to illustrate the metric’s logic requirements. The 
ordinal of size nine by nine could feasibly be implemented in an Altera Cyclone IT FPGA, 
an economical FPGA, while reserving the FPGA’s multipliers and most register 
resources for other purposes such as image re-sampling to form Gaussian pyramids which 


are explained later. 


Table 1. Logic Utilization of Ordinal « in Popular Altera Devices. 























Device Combinational Registers Multipliers | DSPs 
ALUTs 

Cyclone II 12,065 / 35,000 | 442 0 / 70) N/A 

(EP2C35F672C6) (~34%) (0%) 

Cyclone III 12,065 / 120,000 | 442 0 / 576) N/A 

(EP3C120F780C7) (~10%) (0%) 

Stratix Ill | 7,986 / 113,600 | 480 / 113,600 | N/A O / 384 

(EP3SL150F1152C3ES) | (7%) (< 1%) (0%) 





Now that the ordinal measure has been introduced, in the next chapter the 
correlation coefficient « is applied and a basic matching algorithm that uses a traditional 


greedy method to solve the correspondence problem is investigated. 
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IV. ORDINAL MATCHING WITH TRADITIONAL MATCHING 
STRATEGY 


Solving the correspondence problem has been studied extensively in computer 
science for the last four decades [10]. In traditional matching strategy, stereo matching 
algorithms applied a “greedy” method for choosing correlated windows. A reference 
window is chosen from the reference image and passed over a region in the target image, 
computing the correlation values of the reference window to a set of candidate windows. 
While variations exist that impose certain constraints, the fundamental technique is to 
choose the match in the target image region with the maximum correlation value. At the 
time of the ordinal’s development, a traditional greedy method prevailed among window- 


based approaches to solving the correspondence problem. 
A. TRADITIONAL MATCHING STRATEGY 


Traditional matching strategies use several techniques to enhance the greedy 
method. Many of these techniques apply constraints to the matching process to reduce the 
chance of a false match. Some research has applied probabilistic methods to decide the 
likelihood of a true match, while other approaches have applied Kalman filtering to help 
reduce false matches [21], [22]. However, these techniques make assumptions about the 
depth information between the stereo pair as a linear process that can be modeled. The 
depth process behaves somewhat linearly in image regions with consistent intensity 
values but can behave very non-linearly along edges and sharp intensity changes common 
of depth discontinuities. For example, Kalman filtering naturally results in blurring of the 
depth wherever a depth discontinuity exists. Thus, stereo correspondence algorithms can 
are hampered by techniques that attempt to improve matching under certain conditions 
but do not adequately consider the depth process as a whole. For this research, strategies 
that make as few assumptions as possible about the depth process are considered, and 


their computational complexity investigated. 
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1. Multirate Matching With a Gaussian Pyramid 


Some traditional matching techniques are ubiquitous among nearly all stereo 
correspondence algorithms. Multirate processing, in the sense of processing signals at 
different resolutions, has been shown to both greatly improve matching while also 
reducing computational complexity [10]. Multirate entails low-pass filtering the stereo 
pair and down sampling the pair by a integral rate a certain number of times to form a 
pyramid structure of coarse to fine resolutions. In window based methods, multirate 
matching offers more image intensity information at each coarser level while maintaining 
a constant window size at reduced computational complexity. Matching using a coarse- 
to-fine scheme gives matching algorithms a “‘first-look” of the depth process without 
varying the size of windows. The matching process then proceeds to finer resolutions 
(larger dimension images) using the solution of the previous coarse resolution as a 
constraint. Multirate takes advantage of the spatial correlation between the different 
resolutions’ solutions and rules out matches in subsequently finer resolutions that are 
determined to be too far in disagreement from the coarse resolution solution. A Gaussian 


pyramid is illustrated in Figure 26. 
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Figure 26. Typical Gaussian Pyramid. After [19] 
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Before down-sampling an image, it is necessary to low-pass filter the image to 
reduce aliasing effects. Aliasing can introduce artifacts that distort the matching process 
by attempting to match the aliased artifacts rather than the valid image information. Also, 
the aliasing from down-sampling one image in a stereo pair can be substantially different 
from the aliasing and information in the other image of the stereo pair. A common 
method in stereo to reduce this aliasing is to filter both images with a two-dimensional 
Gaussian pulse type low-pass filter. This pulse is defined by 

(ext ex) 
io? 


K(n,,n,) =e (13) 


The standard deviation is arbitrarily chosen to reduce artifacts. For this research, a nine 


by nine pulse kernel K(n,,n,) (filter window where N is the dimension of the window) 


with a standard deviation of 0.7 was used with a down-sampling rate of two such that 
each coarser image is successively half the width and half the height of its finer 


counterpart. The pulse and frequency response is shown in Figure 27. 
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Figure 27. A Gaussian Pulse (9x9) and its Frequency Response. 


The pulse is convolved over the whole image, zero padding the image edges to 


reduce edge effects. After filtering the image, pixels are extracted at the specified rate in 


af 


the horizontal and vertical directions to form a lower resolution image. The two- 
dimensional convolution operation is defined by 


y(nin,)= YY 1(ksk,)K (1, ~kys, —ky). (14) 


k, =—00 ky =—00 


The matching process proceeds from the coarsest to the finest image. At the 
conclusion of the solution of the correspondence problem at a particular resolution, the 
solution is multiplied by two and up-sampled to the next highest resolution using a zero- 
order hold. The choice of matches at the finer resolution is confined to the neighborhood 
of the coarse resolution solution. The initial solution prior to the coarsest resolution is 
assumed to be zero at every point in the reference image. Each finer resolution solution is 
a refinement of the previous coarse resolution solution. Thus, getting the correct matches 
at the coarsest resolution is of paramount importance. A matching error (false match) at a 
coarse resolution often results in the error propagating to all subsequent finer resolutions. 
Loosening restrictions on neighborhood matching helps reduce the impact of an error at a 


coarse resolution. 
Zz. Back Matching Strategy 


The greedy method can be constrained by enforcing the matches in the forward 
direction to agree with matches in the reverse direction (a strategy used by Bhat and 
Nayar [11]). For example, a target match chosen in the right image by the greedy method 
must also match by the greedy method within a certain tolerance in the left image when 
the right image is taken as a reference. This process is illustrated in one dimension in 
Figure 28, which shows two candidate matches to window 3 in the reference vector: 
window 4 and window 6 in the target vector. Window 6 is chosen since it matches back 


to a closer neighborhood. 
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Figure 28. Illustration of the Back Matching Strategy. 


The back matching strategy helps eliminate candidate matches that do not agree in 
the reverse direction. This commonly occurs in occlusion regions where an occluded 
pattern may not exist between the two images. Therefore, the back matching strategy aids 
in preventing occluded patterns from choosing false matches. Back matching generally 
improves the probability of choosing a true match under normal conditions while not 


making any assumption about the underlying depth process. 

The back matching strategy is the core of the traditional algorithm approach to the 
correspondence problem used in this research. Given a random, one dimension signal x, 
as an input to the algorithm and a shifted version of x, corrupted by noise as the other 
input signal x,, the associated calculated disparity d and correlation coefficient « for 


every matched pattern are shown as the output of the algorithm. Figure 29 shows a basic 


block diagram of the inputs and outputs of the algorithm. 
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Figure 29. Block Diagram of Basic Back-Matching Strategy Algorithm. 
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Each disparity d is calculated by rectangular windowing nine samples from the 
reference signal x, and computing the « across the entire target signal x,. The top three 
matches by the correlation coefficient « are recorded. Each top match window is taken 
as a reference and back matched by computing the correlation coefficient «. If the back 
matching calculates the top match in the reverse direction to be within one pixel of the 
original reference window with respect to each window’s first sample, then a disparity d 


is calculated by taking the index of the first sample in the matching window in signal x, 
minus the index of the first sample in the matching window in signal x,. The associated 
correlation « for the match pair and disparity d are recorded with respect to the first 
sample of the window in signal x,. The result is shown in Figure 30. It is clear from the 


example that errors still occur using the back matching strategy since the true disparity is 


+2 for all points. The last eight samples are zero due to edge effects with the size w=9 
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3. Computational Complexity and Optimization 


The computational complexity of window based matching methods can be very 
large. If a reference window has dimension w in rows and columns, then from a 


reference image of M, and WN, rows and columns, respectively, there are 
(M ie w) ‘(N, 3 w) such reference windows in the reference image. Also, using the same 


sized target image and target windows, there are (M,—w)-(N,—w) target windows for 
comparison. If each reference window is tested against all possible target windows, the 
complexity is @((M, —w)(N, —w)(M,—w)(N, —w))where M, and N, are the rows 
and columns of the reference image, respectively, and M, and JN, are the rows and 


columns of the target image, respectively. This complexity is often excessive for a 
normal set of images, and a reference window need not be tested against every possible 


window in the target image. 


Proper optimization of the algorithm significantly reduces the computational 
complexity into a manageable subset of necessary comparisons. First, down-sampling the 
reference and target images reduces the number of rows and columns in each image and 


reduces the number of reference and target windows that need to be compared. Thus, the 
M M 

complexity at any level is o(( a w( a wl = wl Na »)) where r is the 
r r r r 


down sampling rate, w is the dimension of the window and / is the level being 








evaluated. This is due to the fact that an M by N image down-sampled / times by r 


has dimensions ud and ih . Second, applying a search range such that a candidate match 
r r 


can only exist within a search range D around the reference window reduces the 
complexity to only the target windows within the search range D that need to be checked 


for each reference window. The resulting optimized complexity for down-sampling and 





applying a search range is simplified to o((44 w(t w}2)=0(mx/0). 
r 


I-1 I-1 
r 


Ideally, the search range D is chosen to account for the maximum expected disparity 


4] 


(translation of any point in space between the reference and target images). The search 
range is reduced from the finest resolution at the same rate as the down-sampling of the 
images to form the Gaussian pyramid. For example, if a coarse level is half the resolution 
in rows and columns of its next finest level, the search range within the coarse level 
should be only half the search range of that at the next finest level. A search in the target 
image for any disparity greater than the maximum disparity offers no additional 


information useful for constructing the solution for any level. 


Back matching increases the complexity of solving the correspondence problem 
by multiplying the complexity by at least two. This complexity is increased even more if 
a top set U of matches (the matches depth) are tested in the reverse direction with each 


candidate match independently tested in the reverse direction for an overall algorithm 


complexity of ©(MNDU ) where M is the rows of the image, N is the columns of the 


image, D is the search range for each reference window and U is the set of candidate 
matches that need to be back tested for each reference window. In practice, the number of 
top candidate matches that need to be tested in the reverse direction is usually no more 
than three since the majority of successful back matches are resolved in the first or 
second attempt. Failure to find a reverse match by the third attempt usually indicates that 
no match for the reference window exists in the target region. If a candidate match is 
found to be within tolerance in the reverse direction, the remaining candidate matches are 


not tested to reduce computational complexity. 


Another computational complexity of note is the memory requirements of the 
traditional method. Because of the high matching noise as a result of the back matching 
strategy and the need for error correction, the entire solution for a level in the Gaussian 
pyramid must be computed before spatial error correction can begin. Error correction is 
discussed later in the chapter. Thus, in addition to the memory storage of every image 


within both Gaussian pyramids, the solution at every level must also be stored. 


The memory requirement of the Gaussian pyramid of whole images is 4/3 times 
the memory storage of the highest resolution. This comes directly from the geometric 


series of the Gaussian pyramid decomposition where r= 2 for each dimension. Since the 
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solution matrix is of the same dimension of each corresponding level in the pyramid, the 
solution occupies a proportional amount of memory to one of the stereo Gaussian 


pyramids. 
B. ORDINAL LIMITATIONS WITH TRADITIONAL MATCHING 


It has been shown that ordinal measures are very robust for pattern matching. A 
good pattern matching algorithm is the heart of a stereo matching algorithm that seeks to 
solve the correspondence problem for computer vision. Yet, the ordinal only describes 
how matched two patterns are. Thus, untextured or low texture regions of images with 
low frequency content can appear to have no discernible true match to a window based 
pattern matching scheme. This fact suggests that although the ordinal, a window based 
metric, is highly non-linear, it does have a minimum reliable frequency response. For 
example, given two blank images with no frequency content and no features at all, the 
depth map between the two images is undefined. Furthermore, pattern matching can also 
be unreliable in highly uncorrelated patterns. Given a reference and target window with 
no consistent information between the two, the measure of the patterns should ideally be 
perfectly uncorrelated. In practice, uncorrelated inputs can appear to have some matching 


behavior, which can appear to the ordinal as a potential match. 
1. Untextured and Low Texture Image Regions 


Untextured and low texture comparisons can produce unexpected results for 
window based matching schemes. A pattern matching algorithm with a set of windows 
that have no significant information for matching yields inconclusive results. In the case 
of the ordinal, comparing untextured data appears to be a perfect match, which can 


mislead the decision process. An example of untextured inputs x, and x, and the point 


by point comparison (x, (n:n+w-l),x,(nin+ w-1))is shown in Figure 31. 
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Figure 31. Example of Untextured Inputs and Resulting Point by Point x . 


Applying the back matching strategy does not help the matching of untextured 
patterns. This is illustrated in Figure 32 of totally untextured signals and Figure 33 with a 


shifted step where only the step itself has valid data. 
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Figure 32. Back-Matching Strategy on Totally Untextured Signals. 
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The first two values of the « and disparity vectors result from the forward 


matching pass choosing the first three matches in x, since all possible matches from x, 
to x, are equal. In the backward matching pass, each back match chooses the first index 
in vector x, as the top match. Thus, the first disparity is the first sample in x, matched to 
the first sample in x,, and the second disparity is the second sample in x, matched to the 
first sample in x, again. All subsequent disparities fail to find a match because they all 
continue to choose the first index in x, as the top back match but exceed the constraint of 


being within one sample of the original sample in x. 
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Figure 33. Back-Matching Strategy on Shifted Step in Untextured Data. 


The shifted step provides some valid detail for the matching algorithm to match. 
The step is detected so long as the step is within the window. The samples before and 
after the step are unmatched due to lack of significant matchable detail. The algorithm 
correctly determines the step to have translated two samples to the right for a disparity of 


plus two. 
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The basic approach to mitigating the effect of untextured matching is to avoid processing 
reference windows that do not have significant enough details necessary for matching. In 
this research, a basic measure of window information R, is given by 


1 n-2 


n—-2 


R= (u(i)-) where u(n)=x(n)—x(n+1). (15) 





This measure, resembling the variance of the derivative, is applied and thresholded (i.e., 
compared to an arbitrarily determined minimum measure of information). If the 
information in the window is above a minimum measure of information, the matching 
and back-matching algorithm is allowed to proceed. Otherwise, the algorithm skips the 
reference window and records a disparity of zero and correlation coefficient of —1. The 
missing solution is accounted for later by error correction methods. A minimum amount 


of information of 0.05 is chosen for window sizes of 81 samples. 
2. Uncorrelated Regions 


Uncorrelated inputs can be another problem for window based matching methods 
and is directly related to the chosen metric’s discrimination of dissimilar patterns. The 
ordinal performs well in discriminating uncorrelated patterns but does still report 
correlation even when the signals are truly independent. Two possible causes for 


uncorrelated inputs are occlusions and image edge effects. 


Uncorrelated inputs commonly produce a moderate level of correlation but rarely 
are perfectly matched. An example pair of input vectors independently and randomly 


generated is input into the algorithm, and the resulting outputs are shown in Figure 34. 
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Figure 34. Back-Matching on Uncorrelated Inputs. 


Generally, uncorrelated inputs produce a maximum « correlation of 0.5. Thus, a 
threshold of the « values themselves to be at least 0.5 in order to be considered a valid 
match candidate helps reduce the chance that uncorrelated and unrelated patterns do not 
factor into the final solution. This number concurs with tests performed by Bhat and 


Nayar [11] on random dot stereograms with no correlated regions. 
es TRADITIONAL AD-HOC ERROR CORRECTION TECHNIQUES 


Since the greedy method introduces significant opportunity for error and that error 
can propagate and multiply from one level in the Gaussian pyramid to the next, error 
correction at the conclusion of each level is necessary to improve the quality of the final 
disparity maps [21]. Several techniques for error correction are considered including 
preventive techniques such as avoid matching windows with minimal information and 


recovery techniques such as masked average and linear hole filling. 
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1, Window Information, Search Ranges and Median Filtering 


As described before, window information is measured with Equation (13). When 
the algorithm computes the minimum information and determines not enough 
information for matching exists, it stores a correlation value of —1 indicating the need for 


correction. 


Search ranges also help to prevent error by avoiding searches for matches outside 
the maximum possible disparity at any given level. Since patterns can repeat in images, 
limiting the search range reduces the chance of choosing a false match within a repeating 


pattern. 


Another effective technique for mitigating error is to median filter the solution 
using windows of seven by seven or nine by nine. A median filter sorts the data (disparity 
in this case) from least to greatest and picks the median value. Median filters are non- 
linear filters that reduce outlier noise. The median value is stored at the centroid of the 
median filter. Median filter is done after the solution has been computed but before any 


further ad-hoc error correction. 
Zi Masked Averaged Hole Fill 


Masked averaged hole filling corrects the disparity map by determining invalid 
regions in the disparity map and filling the “holes” formed by the invalid data with 
immediately adjacent valid data. The segmentation generates a mask of values that need 
to be correct through spatial correlation. The holes are filled iteratively using a block 
structuring element to propagate valid data into the invalid region and a single erosion of 
the mask using the same structuring element. The process continues until all invalid data 


indicated by the mask is eroded away. Thus, the masked averaged hole fill algorithm has 


a theoretical computational complexity of © ((mv )’) if only one valid pixel exists in the 


entire solution and a lower bound complexity of ©(MN ). In practice, the masked 


averaged hole fill algorithm is close to the lower bound complexity unless the disparity 


map has substantial error indicated by the mask. 
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The invalid data is segmented by the threshold of the correlation value of the 
chosen match for each reference point in the reference image. The threshold is chosen to 
be 0.5 to avoid accepting matches that could have resulted from noise or uncorrelated 
inputs. The invalid data also includes points in the solution that are not matched at all 
because the reference window does not contain enough information. The result is a mask 
with the same dimensions as the solution, assigning a value of ‘1’ to an invalid solution 
and a ‘0’ to a valid solution. The mask is passed to the hole filling part of the error 
correction algorithm, and the output of the algorithm is the corrected solution with only 


valid data. A block diagram of the hole filling algorithm is shown in Figure 35. 
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Figure 35. Block Diagram of Ad-Hoc Masked Averaged Hole Fill Algorithm. 


Filling the invalid regions with valid data resembles a zero-order hold. Valid 
solutions are iteratively repeated until the invalid region is filled. If two or more solutions 
are propagated to occupy the same invalid point, the solutions are averaged. The process 


is illustrated in Figure 36. 
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Figure 36. Illustration of Masked Averaged Hole Fill Algorithm. 
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The masked averaged hole fill spatially corrected solution is passed to the next 


step in the disparity computation algorithm. 
3. Masked Linear Hole Fill 


The masked linear hole fill error correction algorithm is similar to the masked 
average hole fill algorithm, but the valid data propagation is extended to a first order 
approximation. The invalid data is segmented by applying the 0.5 minimum threshold 
and assigning a mask bit of ‘1’ to solutions that need to be replaced with the linear 
approximation. The overall block diagram of the masked linear hole fill algorithm is 


shown in Figure 37. 
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Figure 37. Block Diagram of Ad-Hoc Masked Linear Hole Fill Algorithm. 


The linear approximation is computed in the horizontal and vertical directions 
independently and averaged point by point for each invalid solution indicated by the 
mask. First, the bounding valid solutions around the invalid space are identified. The 
algorithm computes the difference between the bounding solutions and divides by the 
number of invalid solutions plus one representing the linear change between each 
solution and its immediate neighbor. The invalid solutions are filled by accumulating the 
change to the previous solution starting from the left valid solution and end at the right 
valid solution. The same process is repeated for every invalid region in the vertical 


direction. The process is illustrated in one dimension in Figure 38. The implementation of 


the masked linear hole fill algorithm has a complexity of ©(MN +k) where k is the 


number of linear corrections needed. 
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Figure 38. Illustration of Masked Linear Hole Fill Algorithm. 


The masked linear hole fill spatially corrected solution is passed to the next step 


in the disparity computation algorithm. 


D. RESULTS 


The traditional method solves the correspondence problem with a dense disparity 
map, providing a solution vector for every point in the reference image. Applying error 
correction, we discard portions of the dense disparity map and replace the holes with 
redundant data. Thus, the raw dense disparity map produced by the back matching 
strategy can be seen as a sparse matrix of valid solutions, and error correction recovers 
the missing data between valid solutions. The raw solutions alone do not satisfactorily 


meet the objective of accuracy set out in Chapter I. 
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if Raw Back Matching Output 


First, the raw output from the back matching strategy is evaluated. For 
computational reasons, the search range and Gaussian pyramid are used in producing the 
raw output, and therefore, their impacts are not independently considered. The disparity 
maps are heavily corrupted with errors, particularly in regions with low information. The 


raw output using three levels and a search range of +/— 5 at the coarsest resolution and 


+/— 1 at each finer resolution is shown in Figure 39. 








Reference Image Disparity K 


Figure 39. Raw Output of Multiple Images: Ball, Meter 
and Shrub. After [19], [23], [24] 
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The intensity at each point is the magnitude of the vector describing the change in 


x and y for the pattern in the reference image to the matched target pattern in the target 


image. The associated « at each point is shown for illustration. Higher intensity « 
means that the chosen match at that point was a good one, and low intensity or black 


mean that the chosen match was weak or the algorithm did not find a match. 


The uncorrected error corrupts the disparity maps beyond usefulness. Thus, 
additional error correction is necessary at each level for the disparity maps to be 


meaningful for any subsequent image segmentation. 


Ze Spatially Correlated Correction 


a. Masked Averaged Hole Fill With Median Filtering 


The masked averaged hole fill algorithm is applied to the resulting raw 
disparity map solution output from the back matching strategy algorithm at each level. 


The results are shown in Figure 40 using three levels and a search range of +/—5 at the 


coarsest resolution and +/— 1 at each finer resolution. 


53 








Reference Image Disparity Mask 


Figure 40. Masked Average Hole Fill Corrected Disparity Maps of 
Multiple Images: Ball, Meter and Shrub. After [19], [23], [24] 


Like before, the intensity at each point is the magnitude of the disparity 
vector with respect to each point in the reference image pointing to the match in the target 
image. The associated mask is the result of thresholding the correlation coefficients « 
below 0.5, and white pixels indicate where spatial correlation correction was applied. The 
white pixels in the mask are the discarded disparity solutions that were replaced by the 


zero-order hold nearest neighbor approach of the masked averaged hole fill algorithm. 
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One can notice by the mask that significant error exists using the traditional matching 


techniques, and about half of each disparity map consists of spatially correlated values. 


The error corrected disparity maps are cleaner and more consistent. The 
results illustrate that the traditional method can produce disparity maps with enough 
quality for object segmentation. However, the accuracy of the disparity maps is quite 
low. For example, the sign is certainly distinct from the background, but the edges 
determined by the masked averaged hole fill algorithm are quite different from the true 
edges of the sign in the reference image. This effect can also be seen in the significant 
uncertainty around the edges of the ball (as depicted in the ball mask) or the near field of 
the meter mask. Therefore, sharp edges in the disparity map help to segment objects but 


do not indicate accuracy in the disparity map. 
b. Masked Linear Hole Fill With Median Filtering 


The masked linear hole fill algorithm is also applied to the resulting raw 
disparity map solution output from the back matching strategy at each level like the 
masked average hole fill. The difference is the linear interpolation of the sparse disparity 
solution matrix after masking to a dense disparity solution matrix. The results are shown 


in Figure 41 using three levels and a search range of +/— 5 at the coarsest resolution and 


+/— 1 at each finer resolution. 
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Reference Image Disparity Mask 


Figure 41. Masked Linear Hole Fill Corrected Disparity Maps of 
Multiple Images: Ball, Meter and Shrub. After [19], [23], [24] 


Like the masked average hole fill results, the intensity of the disparity map 
indicates the magnitude of the disparity vector with respect to the reference image, and 
the mask indicates where spatial correction was applied (white pixels). The disparity map 
solutions are linearly interpolated by the masked linear hole fill algorithm, and the effects 


can be seen by the slight banding of intensity values around the solutions. 
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The error corrected disparity maps with the linear interpolation produce 
cleaner and more consistent disparity maps when the invalid solutions occupy regions of 
varying depth while having similar impact to the masked averaged hole file in regions of 
constant disparity. Thus, the masked linear hole fill algorithm generally does a better job 
of error correction than the masked averaged hole fill. However, the accuracy is still 
quite low. The edges in the disparity map are blurred and seem to more accurately depict 
the depth process in the ball image, but the masked linear hole fill algorithm fails to 
accurately determine the depth process in both the shrub and meter disparity maps much 


like the masked averaged hole fill algorithm. 
a Computational Complexity and Memory Requirements 


The traditional back-matching strategy is not particularly computationally efficient. 
For the raw results alone, the computational complexity is high even with the Gaussian 
pyramid sub-sampling with a growth function ©(MNDU ) where M and WN are the rows 
and columns of the reference image, D is the search range and U is the set of matches to 


be back-matched. Employing the masked averaged hole fill algorithm could add 


oO (mn )’) between each level in the pyramid. The masked linear hole fill algorithm adds 


©(MN + k) complexity where k is the number of corrections to be made between each 
level in the pyramid. Thus, employing the masked linear hole fill algorithm for error 


correction would have a total computational complexity of @(MNDU +(MN +k)-L) 


where L is the number of levels. Also, the nature of the traditional back-matching strategy 


makes the true computational complexity difficult to predict. 


The memory requirement for the traditional algorithm is approximately three times 
the memory occupied by the original stereo pair. This accounts for both full-sized Gaussian 
pyramids along with full-sized intermediate solutions. Full-sized intermediate solutions are 
needed for the ad-hoc error correction to function properly. Also, the precise memory 
requirements during run-time are unpredictable because the ad-hoc error correction 


depends on the number of invalid solutions indicated by the error correction mask. 
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The computational cost and memory requirements running the traditional 
matching technique with a nine by nine window, three levels and +/—5 disparity at the 
coarsest levels and +/— 1 disparity at each finer resolution are shown in Table 2. All 
experiments were performed on a computer equipped with an Intel Core 2 Extreme 
X9000 processor and 4 GB of available RAM. The peak memory requirements are also 
shown (including the buffering of the stereo pair of images). These measurements were 
taken using Microsoft Windows Task Manager and are shown only for illustrative 


purposes. 


Table 2. Traditional Matching Timing and Memory Requirements. 














Stereo Pair Processing Time Peak Memory Usage 
Ball (256 x 256) 30s 2976 kBytes 

Meter (512 x 480) llls 9684 kBytes 

Shrub (512 x 480) 1278 10584 kBytes 














The processing time is approximately linear with respect to the size of the input 
images, but the memory requirements can vary significantly depending on the amount of 
error correction needed. Also, a significant portion of the processing time (about 95%) 
was dedicated to calculating the correlation coefficient « even with a fully optimized C 


implementation. 


The traditional method was investigated and demonstrates the ability of the 
ordinal to solve the correspondence problem by the greedy method. In the next chapter, a 
more efficient dynamic programming algorithm with better memory requirements is 
investigated. Dynamic programming offers improved matching, complexity and memory 


requirements. 
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V. ORDINAL MATCHING WITH DYNAMIC PROGRAMMING 


In recent years, dynamic programming techniques have been applied to the 
correspondence problem, often producing better results than the traditional greedy 
method of matching [12], [13], [25]. Dynamic programming computes solutions to the 
correspondence problem to optimality given constraints on smoothing and spatial features 
of the depth process. Thus, unlike the traditional method, dynamic programming can 
account for the disparity of a set of points simultaneously before committing to a 
solution. This advantage of dynamic programming reduces the need for error correction 


to produce good quality disparity maps. 


Dynamic programming is often applied to solve stereo correspondence problems 
reduced to one dimension. Those algorithms begin by rectifying the scan lines of each 
stereo image so that the scan lines coincide with the epipolar plane formed by the stereo 
pair of cameras. Rectification requires information about the stereo pair of cameras, 
which means that such approaches are not a general solution to the correspondence 
problem but a specific one. Therefore, a unique method of two-dimensional dynamic 
programming particular to window based methods is investigated that retains general 


solution qualities. 


A. DYNAMIC PROGRAMMING MATCHING STRATEGY WITH THE 
ORDINAL 


Changming Sun [12] illustrated a dynamic programming approach to window 
based correlation methods similar to the ordinal. His algorithm was shown to be very 
efficient, solving the correspondence problem in milliseconds on a typical personal 
computer and producing results comparable to the graph cuts method (very accurate but 
one of the most computationally expensive and dynamic approaches). Sun’s dynamic 
programming method can be adapted to use the ordinal instead of the zero-mean 


normalized cross co-variance chosen for his research. 


The dynamic programming method centers on the construction of a solution 


volume where the disparity map is the surface with the maximum correlation within the 
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volume. While the solution volume can be extended to hyper dimensionality to account 
for two dimensional disparity vectors, the solution volume is instead constructed for only 
disparity along the scan lines (horizontal translation). The consequence of this 
simplification is that as the epipolar plane deviates from the scan lines, the accuracy of 
the matching algorithm falls off dramatically. This can occur if one camera in the stereo 
pair is out of alignment with respect to the other camera rotationally, vertically translated 
or both. For this research, the stereo pair is assumed to be in sufficient alignment that the 
epipolar plane is consistent enough for successful window based matching along scan 


lines. 
ik, Multirate and Sub-Regioning With a Pyramid 


Like the traditional back-matching method, Sun’s algorithm uses multirate to 
reduce computational complexity and improve matching accuracy. The down-sampling 
procedure is simplified from the Gaussian pyramid down-sampling using the follow 
steps: 


1. Given a sub-region of M, x N, pixels and r down-sampling rate, the sub- 
region is divided into non-overlapping r x r blocks. 


De The down-sampled sub-region is the simple average of each r x r block. 

The dynamic programming method proposed by Sun also introduces sub- 
region partitioning to improve computational complexity. Each sub-region is chosen 
to minimize the complexity ©(M,N,D,)where M, is the rows of sub-region i, N, is 
the columns of sub-region 7 and D, is the expected maximum range of disparities 
within the sub- region 7. The partitioning results from a divide-and-conquer scheme 
where an initial sub-region partitioning is refined by merging regions with similar 
complexity and dividing larger regions into smaller regions with less cumulative 
complexity. Therefore, the goal is to divide the level in the pyramid into small regions 


of high deviation disparities and larger regions with low deviation of disparities. The 


sub-region partitioning is known to be less complex than ©(MND) where M and N 
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are the rows and columns respectively of the un-partitioned level in the pyramid and 


D is the maximum disparity range of the level. 


Partitioning the working pair at a level in the pyramid into sub-regions constitutes 
the first step to constructing a volume of solutions. For a particular sub-region 7 taken 
from the reference and target images, the solution volume is calculated by shifting the 


target sub-region through the range of disparities -—d to +d where 


max max ? 


[-d d.,.|=D, with respect to the reference sub-region. For each point in the 


max ’?*** ? “max 
reference sub-region, the associated correlation values at each disparity d are recorded in 


the depth of the volume. The process is illustrated in Figure 42. 
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Figure 42. Illustration of Constructing the Solution Volume C (i jd ) P 
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At each level / where the base of the pyramid is level zero, the optimal surface of 
the disparity is calculated and propagated to the next level /—lin the pyramid to 
constrain the following search to within the neighborhood of the previous target match. 
The disparity map of the previous level /+1 is up-sampled and multiplied by the rate 
used to construct the pyramid. After up-sampling, the disparity map is linearly 
interpolated to reduce discontinuities. The coarse to fine algorithm is illustrated in the 
following steps: 


Le Beginning with the top level, partition the current level / into sub-regions 
of minimum complexity. 


2s For each sub-region, compute and solve the solution volume with dynamic 
programming using the up-sampled disparity map of level /+1as the 
initial search. If the current level is the top, the initial disparity map is 
assumed to be all zero. 


5: Up-sample the level / solution by linear interpolation and multiply by the 
sampling rate. 

4, Return to step 1 if level 140. 

2. Solving the Solution Volume With Two-Stage Dynamic Programming 


After the solution volume is constructed, the maximum surface through the 
volume is computed using a two-stage dynamic program (TSDP). The maximum surface 
is the optimal path based on the maximum summation of correlation coefficients and the 
constraint p on the maximum slope of the surface in either the vertical or horizontal 
direction. If the dynamic program is not given a constraint, it selects the maximum 
correlation values at each point in much the same way matches are chosen in the greedy 
method explained in the previous chapter. The maximum surface is illustrated in Figure 


43 and represents the disparity of the sub-region. 
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Figure 43. Maximum Surface through the Solution Volume. After [12] 


The first stage in solving for the maximum surface is to compute the summation 


of correlation values as a cost function from the solution volume. The summation volume 


is Y(i, j,d) where i and j are the rows and columns of the volume and d is the index 


through the range of disparities for each row and column. The summation volume is 


calculated using the recursive formula 
V(i,j.d)=C(i, jd) +max¥ (i, j-Ld +4). (16) 
titl<p 
The first stage of the dynamic program proceeds by selecting a column by 
disparity slice of the solution volume. Then, the dynamic program computes the optimal 
summation of cost functions from the top row to the bottom row in the slice for each 


disparity. This process is illustrated in Figure 44. 
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Figure 44. Computing the Solution Volume by Slice. After [12] 


The second stage of the dynamic program begins by finding the optimal path 


through the row by disparity slice. The optimal path is the minimum distance recursion 


through the bottom slice of the summation volume Y (e j.d ). The minimum distance 
G(i, j,d) recursion is defined by the recursive equation 


G(i, j,d)=Y (i, j.d)+max{G(i-lj.d +t)}. (17) 

tit<p 
Once the optimal path through the bottom slice is determined, the second stage 
proceeds to the next slice (row) up of the summation volume. Subsequent slices’ optimal 
paths above the bottom slice are constrained to be within p disparity of the previous 
optimal path. The second stage dynamic program is illustrated in Figure 45. Thus, the 


minimum distance recursive algorithm for calculating the optimal path through the i” 


slice of the summation volume is: 


1. Compute the distances using Equation 17 starting from i=1 and 
computing the maximum summation given the slope constraint p . 
2 Record the back-pointer to each maximum fr" (as in Equation 17) cost 
summation. 
After the minimum distance recursion, choose the final maximum cost. 
4, Recurs backward using the back-pointers to form the optimal path. 
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Figure 45. Computing the Optimal Path through the Summation Volume. 
After [12] 


Finally, the optimal paths are concatenated into a surface of disparities. The 
resulting surface is the optimal solution of the solution volume. This solution is up- 
sampled, interpolated and multiplied by two if the solved level is not the base of the 


pyramid to constrain the search range of the next finest resolution. 
B. OPTIMIZATIONS 


The dynamic programming approach to solving the correspondence problem 
allows for more freedom in orienting the problem to be spatially (memory) and 
computationally efficient. While Sun’s implementation of the TSDP is efficient for 
software, further optimizations are necessary to make the algorithm more feasible in 
hardware. Hardware efficiency is maximized when the algorithm uses a constant space 
and computational complexity. This is because hardware must always account for the 
algorithm’s worst-case scenario in computational complexity. Anything less than worst- 
case complexity leaves some hardware logic idle, resulting in an_ inefficient 
implementation. Thus, the sub-region partitioning and construction of the solution 


volume are modified to optimize hardware computational complexity. 
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1. Constant Sub-Region Partitioning 


The sub-region partitioning is essential for breaking down the stereo pair into 
manageable bits and preventing the solution volume from becoming overly ponderous. 
However, variable sub-region partitioning requires varying memory requirements, which 
can cause problems for an efficient hardware implementation. Thus, sub-region 


partitioning is refined such that sub-regions have constant dimensions. 


Optimization of sub-region partitioning for hardware has a couple key differences 
from Sun’s method. First, sub-region partitioning can be adapted to operate on different 
parts of the stereo pair independently. This can dramatically reduce the working memory 
requirements of a hardware implementation since the algorithm must only buffer the 
regions within the reference and target image of interest. Therefore, the original stereo 
pair is partitioned into constant dimension sub-regions, and the pyramid decomposition is 
performed on the partitions and not the original stereo pair. Second, the sub-regions are 
kept to a constant square dimension a (with overlapping edges in the target sub-region to 
account for search range edge effects) so that computational complexity is kept constant. 
The algorithm for constant sub-region partitioning is defined using the following steps: 


1. For the reference image, divide the image into axa non-overlapping sub- 
regions. For the target image, divide the image into the same axa sub- 
regions but with additional padding D to account for search range. 


Z For each sub-region, decompose the sub-region into a pyramid. 

3 Solve the solution volume proceeding from coarse to fine. 

4, Insert the disparity map of the finest resolution into the final solution. 
5 


Repeat step two until all sub-regions are solved. 


The sub-region partitioning is illustrated in Figure 46. In this example, the target 
sub-region includes padding to the left and right. In practice, it is only necessary for the 
overlapping padding in the target to be on the right side if the reference image is the right 
stereo image and on the left side if the reference image is the left stereo image for 


simplicity. 
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Figure 46. Example of Constant Sub-Region Partitioning and Pyramid. 


2. Abstract Solution Volume 


Since the sub-regions are now partitioned constantly, generating a solution 
volume for all possible disparities at each level in the pyramid becomes impractical. The 
cubic shape of the solution volume grows to maximum complexity at the base level for 


nearly every sub-region. Therefore, the solution volume is redefined such that the volume 


complexity is always the minimum possible O(M ND) where M, is the rows of the 


constant sub-region at level / and JN, is the columns of the constant sub-region at level / 


and D is the uniform search range at each level. 


The minimum solution volume is achieved by forming an abstract volume of 
solutions around the solution surface of the previous level. After the previous solution at 
the /+1 level is up-sampled and interpolated, a search range is constructed around each 
point in the up-sampled solution surface to form a new solution volume around the /+1 
surface. This simplification in complexity is intuitive since the solution at level / must 
always be within the neighborhood of the /+1 level solution. The new solution volume 
construction algorithm is defined in the following steps: 

i For each pixel in the reference sub-region, construct a reference window. 


Ze Translate to the within the target neighborhood of the previous solution 
and compute the correlation values for D disparities. 


3: Record the correlation values and associated cumulative disparity d in the 
abstract solution volume. 
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The abstract volume can be efficiently implemented using C++ techniques. The 
important innovation that makes the abstract volume possible is the recording of the 
disparity along with the correlation value and always D disparities for each reference 


pixel. This results in a constant O(M ND) complexity with proportional memory 


requirement. To illustrate this, the abstract volume is compared to the cubic volume in 


Figure 47. 
Maximum Maximum 
Solution a Solution 
Surface ~ Surface 









_t d= 0 Solution 
Plane 


d=0 Solution 








Cubic Solution nate Abstract Solution 
Volume C(i,j,d) Volume C(i,j,d) 
Figure 47. Visualization of the Abstract Volume Compared to the 


Cubic Volume. After [12] 


The abstract solution volume is stable so long as the maximum slope p_ of the 


solution surface does not exceed the search range D. If the slope is greater than the 
search range, the abstract solution has discontinuous regions that cannot be solved using 
the second stage minimum distance of the dynamic program. The discontinuity appears to 


the TSDP algorithm as all paths through the discontinuity having infinity cost. 


C. RESULTS 


1. Raw Dynamic Programming Matching Output 


The dynamic programming algorithm yields disparity maps of good quality 
without any error correction. For these experiments, the number of levels is set to three, 


window dimension is 11 by 11, sub-region dimension is four by four and the search range 
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is four. Due to an implementation issue with the dynamic program search function, the 
sub-region dimension cannot get too large or very little of the image will be available for 
processing due to edge effects. Thus, the sub-region dimension is kept small, leading to 
potential for moderate inter-sub-region error. Also, the large black regions on the right 
and bottom of the images are due to the edge of effects of a weak sub-region 
implementation. The disparity maps produced by the algorithm are shown in Figure 48. 
While it is not obvious, the disparity maps produced by the dynamic program are more 
accurate than any of the disparity maps produced by the traditional back-matching 
method. This is evident in the clearly defined edges of objects such as the meter and the 
sign in the meter and shrub stereo pairs where all implementations of the traditional 


method failed. 
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Reference Image Disparity 


Figure 48. Disparity Maps from Dynamic Programming for 
Ball, Meter and Shrub. After [19], [23], [24] 
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Error correction techniques are still possible on the final or intermediate disparity 
maps. Although not shown, the correlation coefficient of each chosen match is known, 
and either the masked averaged hole fill or masked linear hole fill algorithms can be 


applied to spatially correlate disparity values that are below a minimum « threshold. 
Ze Computational Complexity and Memory Requirements 


The dynamic programming with applied abstract volume and constant sub-region 
partitioning optimizations produced results that satisfactorily met the objectives of this 


research. The computational complexity is a constant O(M iN ,D) where M, is the rows 
and WN, is the columns of the chosen sub-region partitioning. It is noteworthy that this 


computational complexity approaches the theoretical minimum possible for the stereo 


correspondence problem. The memory requirements are proportional to the 


computational complexity since only one solution is stored for O(M iN ,D). Using the 


ordinal and minimum complexity abstract solution volumes, we get the disparity map 
computation times for various images and parameters tabulated in Table 3. All 
experiments were performed on a computer equipped with an Intel Core 2 Extreme 
X9000 processor and 4GB of available RAM. The peak memory requirements are also 
shown (including the buffering of the stereo pair of images). These measurements were 
taken using Microsoft Windows task manager and, thus, are inherently inaccurate and 


shown only for illustrative purposes. 














Table 3. © Dynamic Programming Timing and Memory Requirements. 
Stereo Pair Processing Time Peak Memory Usage 
Ball (256 x 256) 26 s 1924 kBytes 
Meter (512 x 480) 124s 4288 kBytes 
Shrub (512 x 480) 124s 4288 kBytes 











The processing time and peak memory usage are linear with respect to the size of 
the input image pair. The peak memory usage was constant since the memory usage is 


entirely dependent on the size of the sub-region partitioning in this implementation. 
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Also, a significant portion of the processing time (about 95%) was dedicated to 


calculating the correlation coefficient « even with a fully optimized C implementation. 


The dynamic programming method was shown to be effective with the ordinal as 
the correlation metric. In particular, it produced more accurate disparity maps compared 
to the traditional greedy method without the need for ad-hoc error correction. However, 
ad-hoc error correction can still be applied to clean up the minor errors in the dynamic 


programming matching process. 
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VI. CONCLUSIONS 


It was shown that the ordinal is a logical measure of pattern correlation that can 
be synthesized into hardware. Since with all stereo correspondence algorithms, the 
pattern matching measure is at the core of all processing, obtaining a fast metric 


implementation is crucial to achieving real-time stereo processing. 
A. SUMMARY OF THE WORK 


First, the ordinal was investigated for performance with the traditional back- 
matching strategy with ad-hoc error correction. The traditional back-matching strategy 
demonstrated that the ordinal is a capable metric for stereo correspondence. However, the 


traditional back-matching strategy was computationally complex with at least 


oO (MNDU ) growth to support the search range D and reverse matching depth U. Also, 


the traditional matching strategy had varying memory requirements and required at least 
three times the memory buffering of the original stereo pair to support the Gaussian 


pyramid and full-size intermediate disparity solutions. 


The ordinal was then investigated for performance with a more recent window- 
based two-dimensional, two-stage dynamic programming method. The dynamic 
programming produced raw results of much higher accuracy compared to the traditional 
strategy un-aided by ad-hoc error correction. Also, the dynamic programming robustness 
offered the opportunity for computational complexity optimizations. Using these 


optimizations, we found that the dynamic programming method had a constant 


computational complexity of ©(M,N,D) based on the M, rows and N, columns of the 


chosen sub-region partitioning and search range D. This computational complexity is 
clearly less than the complexity of the traditional methods. Additionally, the memory 
requirements were consistently proportional to the computational complexity since the 
dynamic programming algorithm uses no dynamic memory allocation like the traditional 
methods. Thus, the dynamic programming algorithm distinctly outperformed the 


traditional methods in accuracy, complexity and memory requirements. 
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Second, the ordinal was investigated for its candidacy toward hardware 


implementation. The proposed architecture can feasibly be synthesized in an FPGA and 


theoretically computes the first correlation coefficient « in 5 +| log, n | clock cycles at 


100 MHz in an Altera Cyclone family FPGA, fully pipelined. While an actual 
implementation was not possible for this research due to time constraints, an 
implementation of only the ordinal as a macro function with the search and solver (1.e., 
the investigated dynamic programming algorithm) left to software could be sufficient to 
achieve real-time dense disparity map computation. This is because, for all experiments, 
computation of the correlation coefficient « accounted for the majority of the processing 


time. 
B. FUTURE WORK 


This thesis presented a powerful method of solving the stereo correspondence 
problem using a logical metric and low complexity dynamic programming search and 
solution algorithms. The continuation of this thread of research is to proceed to an 
efficient hardware implementation and real-time demonstration of stereo correspondence 
on a pair of video inputs. Also, research can be done to implement a real-time stereo 


processor to help navigate real world situations for robotics. 
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